Book Review | New Books September 2003

نویسنده

  • GW Lucier
چکیده

Background: In class prediction problems using microarray data, gene selection is essential to improve the prediction accuracy and to identify potential marker genes for a disease. Among numerous existing methods for gene selection, support vector machine-based recursive feature elimination (SVM-RFE) has become one of the leading methods and is being widely used. The SVM-based approach performs gene selection using the weight vector of the hyperplane constructed by the samples on the margin. However, the performance can be easily affected by noise and outliers, when it is applied to noisy, small sample size microarray data. Results: In this paper, we propose a recursive gene selection method using the discriminant vector of the maximum margin criterion (MMC), which is a variant of classical linear discriminant analysis (LDA). To overcome the computational drawback of classical LDA and the problem of high dimensionality, we present efficient and stable algorithms for MMC-based RFE (MMC-RFE). The MMC-RFE algorithms naturally extend to multi-class cases. The performance of MMC-RFE was extensively compared with that of SVM-RFE using nine cancer microarray datasets, including four multi-class datasets. Conclusion: Our extensive comparison has demonstrated that for binary-class datasets MMCRFE tends to show intermediate performance between hard-margin SVM-RFE and SVM-RFE with a properly chosen soft-margin parameter. Notably, MMC-RFE achieves significantly better performance with a smaller number of genes than SVM-RFE for multi-class datasets. The results suggest that MMC-RFE is less sensitive to noise and outliers due to the use of average margin, and thus may be useful for biomarker discovery from noisy data. Background Microarray technology allows us to measure the expression levels of thousands of genes simultaneously. A vast amount of data produced by microarrays pose a great challenge on conventional data mining and machine learning methods, because the number of genes often exceeds tens of thousands, whereas the number of samples is at most a few hundred. Along with clustering and classification of genes and/or samples, gene selection is an important aspect of microarray data analysis, and has been a central issue in recent Published: 25 December 2006 BMC Bioinformatics 2006, 7:543 doi:10.1186/1471-2105-7-543 Received: 27 July 2006 Accepted: 25 December 2006 This article is available from: http://www.biomedcentral.com/1471-2105/7/543 © 2006 Niijima and Kuhara; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Page 1 of 18 (page number not for citation purposes) BMC Bioinformatics 2006, 7:543 http://www.biomedcentral.com/1471-2105/7/543 years [1,2]. Specifically, gene selection is used to identify genes most relevant to sample classification, for example, those differentiate between normal and cancerous tissue samples. Gene selection plays essential roles in classification tasks. It improves the prediction accuracy of classifiers by using only discriminative genes. It also saves computational costs by reducing dimensionality. More importantly, if it is possible to identify a small subset of biologically relevant genes, it may provide insights into understanding the underlying mechanism of a specific biological phenomenon. Also, such information can be useful for designing less expensive experiments by targeting only a handful of genes. The most common gene selection approach is so-called gene ranking. It is a univariate approach in the sense that each gene is evaluated individually with respect to a certain criterion that represents class discrimination ability. The criteria often used are e.g., t-statistics, the signal-tonoise (S2N) ratio [3,4] and the between-group to withingroup (BW) ratio [5]. Although such gene ranking criteria are simple to use, they ignore correlations or interactions among genes, which may be essential to class discrimination and characterization. Among existing gene selection methods, support vector machine-based recursive feature elimination (SVM-RFE) [6] has become one of the leading methods and is being widely used. It is a multivariate approach, hence the correlations among genes can be taken into account. Moreover, since the selection is based on an SVM classifier, a subset of genes that yields high classification performance can be identified. Recently, the successful application of SVM-RFE has motivated the development of several SVMbased gene selection methods [7-9]. The SVM-based approach performs gene selection using the weight vector of the hyperplane constructed by the samples on the margin, i.e. support vectors. However, while this property may be crucial for achieving good generalization performance, the effect of using support vectors on gene selection remains unclear, especially when it is applied to noisy, small sample size microarray data. A recent work by Li and Yang [10] implies that only penalizing redundant genes for the samples on the margin may lead to poorer performance. In this paper, we propose a recursive gene selection method based on the maximum margin criterion (MMC) [11], which is a variant of classical linear discriminant analysis (LDA). Guyon et al. [6] compared the performance between SVM-RFE and classical LDA-based RFE (LDA-RFE), and claimed that the use of support vectors is critical in eliminating irrelevant genes. However, the comparison is insufficient in the following respects: • For computational reasons, LDA-RFE was performed by eliminating half of genes at each iteration, whereas SVMRFE by eliminating one gene at a time. • Cross-validation was performed improperly [12]. • The comparison was made only on a single dataset. The computational drawback of classical LDA limits the use of LDA-RFE for gene selection. This paper presents efficient and stable algorithms for MMC-based RFE (MMC-RFE), which overcomes the singularity problem of classical LDA and the problem of high dimensionality. To validate the effectiveness of MMC-RFE, we extensively compare its performance with that of SVM-RFE using nine cancer microarray datasets. Results and discussion Datasets In this study, we used nine public datasets of cancer microarrays. Five of the datasets concern binary-class prediction problems: normal versus tumor for Colon cancer [13] and Prostate cancer [14], ALL versus AML for Leukemia [3], and clinical outcome for Medulloblastoma [15] and Breast cancer [16]. Four of the datasets are on multiclass subtype prediction problems: MLL [17], SRBCT [18], CNS [15], and NCI60 [19]. The details of these datasets are described below: Colon cancer dataset [13] This Affymetrix high-density oligonucleotide array dataset contains 62 samples from 2 classes of colon-cancer patients: 40 normal healthy samples and 22 tumor samples. The expression profiles of 2000 genes are used. The dataset is publicly available at [20]. Prostate cancer dataset [14] This Affymetrix high-density oligonucleotide array dataset contains 102 samples from 2 classes: 50 normal tissue samples and 52 prostate tumor samples. The expression profiles of 12600 genes are used. The dataset is publicly available at [21]. Leukemia dataset [3] This Affymetrix high-density oligonucleotide array dataset contains 38 samples from 2 classes of leukemia: 27 acute lymphoblastic leukemia (ALL) and 11 acute myeloid leukemia (AML). The expression profiles of 7129 genes are used. The dataset is publicly available at [21]. Other 34 samples consisting of 20 ALL and 14 AML are used as an independent test set as mentioned later. Medulloblastoma dataset [15] This Affymetrix high-density oligonucleotide array dataset contains 60 samples from 2 classes on patient survival Page 2 of 18 (page number not for citation purposes) BMC Bioinformatics 2006, 7:543 http://www.biomedcentral.com/1471-2105/7/543 with medulloblastoma: 21 treatment failures and 39 survivors. The expression profiles of 7129 genes are used. The dataset is publicly available at [21]. Breast cancer dataset [16] This cDNA microarray dataset contains 76 samples from 2 classes on five-year metastasis-free survival: 33 poor prognosis and 43 good prognosis. The expression profiles of 4918 genes are used. The dataset is publicly available at [22]. Other 19 samples with 12 poor prognosis and 7 good prognosis are used as an independent test set as mentioned later. MLL dataset [17] This Affymetrix high-density oligonucleotide array dataset contains 57 samples from 3 classes of leukemia: 20 acute lymphoblastic leukemia (ALL), 17 mixed-lineage leukemia (MLL), 20 acute myelogenous leukemia (AML). The expression profiles of 12582 genes are used. The dataset is publicly available at [21]. Note that a test dataset consisting of 15 samples is not used here. SRBCT dataset [18] This cDNA microarray dataset contains 63 samples from 4 classes of small round blue-cell tumors of childhood (SRBCT): 23 Ewing family of tumors, 20 rhabdomyosarcoma, 12 neuroblastoma, and 8 non-Hodgkin lymphoma. The expression profiles of 2308 genes are used. The dataset is publicly available at [23]. Note that a test dataset consisting of 20 SRBCT and 5 non-SRBCT samples is also available, but is not used here. CNS dataset [15] This Affymetrix high-density oligonucleotide array dataset contains 42 samples from 5 different tumors of the central nervous system (CNS): 10 medulloblastomas, 10 malignant gliomas, 10 atypical teratoid/rhabdoid tumors, 8 primitive neuro-ectodermal tumors, and 4 human cerebella. The expression profiles of 7129 genes are used. The dataset is publicly available at [21]. NCI60 dataset [19] This cDNA microarray dataset contains 61 samples from 8 classes of human tumor cell lines: 9 breast, 5 CNS, 7 colon, 8 leukemia, 8 melanoma, 9 non-small cell lung carcinoma, 6 ovarian, and 9 renal tumors. The expression profiles of 3938 genes are used. The dataset is publicly available at [24]. Preprocessing For the Prostate cancer, Leukemia, Medulloblastoma, MLL, and CNS datasets, expression values were first thresholded with a floor of 100 (10 for Prostate cancer) and a ceiling of 16000, followed by a base 10 logarithmic transform. Then, each sample was standardized to zero mean and unit variance across genes. For the Colon cancer dataset, after a base 10 logarithmic transform, each sample was standardized. For the Breast cancer dataset, after the filtering of genes following [16], each sample was standardized. For the NCI60 dataset, after filtering genes with missing values, a base 2 logarithmic transform and standardization were applied. For the SRBCT dataset, the expression profiles already preprocessed following [18] were used. Gene selection methods for comparison As a baseline gene selection criterion, we employed the S2N ratio [4] for binary-class problems, and the BW ratio [5] for multi-class problems. Top-ranked genes with the largest ratios were used for classification. We primarily compared two algorithms for MMC-RFE, called uncorrelated MMC-RFE and orthogonal MMC-RFE (see Methods), with SVM-RFE. For the SVM classifier, we used both hard-margin SVM and soft-margin SVM with linear kernel. The effect of using support vectors on gene selection may be directly evaluated by hard-margin SVM, i.e. when setting the soft-margin parameter C to infinity. The use of soft-margin SVM can alleviate the influence of noise and outliers to some extent and avoid overfitting of the data, with the trade-off between training errors and the margin. In the experiments, we used a wide range of values for the C parameter: C = {0.001, 0.01, 0.1, 1, 10, 100, 1000}. The extension of SVM to more than two classes is not obvious. Hence, several approaches have been proposed for multiclass SVMs, of which we employed one-versus-all SVM (OVASVM). Ramaswamy et al. [25] showed the effectiveness of the OVASVM approach for gene selection and classification, and Weston et al. [8] also applied it to gene selection in multi-class problems. In this study, OVASVMbased RFE was performed in the same way as in [8]. For the implementation of SVM-RFE, we exploited the Spider library for MATLAB, which is publicly available from [26]. Performance evaluation We assessed the performance of each gene selection method by repeated random splitting; the samples were partitioned randomly in a class proportional manner into a training set consisting of two-thirds of the whole samples and a test set consisting of the held-out one-third of the samples. To avoid selection bias, gene selection was performed using only the training set, and the classification error rate of the learnt classifier was obtained using the test set. This splitting was repeated 100 times. The error rates averaged over the 100 trials and the corresponding standard error rates are reported. As a baseline classification method, we employed the nearest mean classifier (NMC), which has been found effective for cancer classification [27]. We combined each gene selection method with NMC. Although the nearest Page 3 of 18 (page number not for citation purposes) BMC Bioinformatics 2006, 7:543 http://www.biomedcentral.com/1471-2105/7/543 neighbor classifier (NNC) was applied as well, NMC consistently showed favorable performance compared with NNC in the repeated random splitting experiments, and thus the results on NMC are reported here. While the performances of the gene selection methods can be compared fair by using the same classifier, SVM-RFE is often used as an integrated method of gene selection and classification, and MMC-RFE may also perform better when used with the MMC classifier (see Methods). With this view, we further compared the performance between SVM-RFE in combination with the SVM classifier and MMC-RFE with the MMC classifier. For multi-class datasets, the OVASVM classifier was used. As suggested by Weston et al. [8], to save computational time of RFE, we removed half of the genes until less than 1000, and then a single gene at a time. In this study, we do not address the problem of finding the optimum number of genes that would yield highest classification accuracy. Instead, the number of genes was varied from 1 to 100, and the performances were compared for each number of genes. Performance comparison for binary-class datasets Tables 1 and 2 show the average error and standard error rates of each combination of classifiers and gene selection criteria for the binary-class datasets: Colon cancer, Prostate cancer, Leukemia, Medulloblastoma, and Breast cancer. Figures 1 and 2 plot the average error rates as a function of the number of genes from 1 to 100. In the tables and figures, MMC-RFE(U), MMC-RFE(O), SVMRFE(H) and SVM-RFE(S) stand for uncorrelated MMCRFE, orthogonal MMC-RFE, hard-margin SVM-RFE and soft-margin SVM-RFE, respectively. For SVM-RFE(S), the best result with respect to the C parameter is shown. Our observations from these results are as follows: • NMC+MMC-RFE(U,O) versus NMC+SVM-RFE(H,S) – Overall, MMC-RFE(U,O) shows intermediate performance between SVM-RFE(H) and SVM-RFE(S) with the best C parameter. MMC-RFE(O) is consistently better than MMC-RFE(U), and notably MMC-RFE(O) performs the best for Leukemia. In most cases, however, the difference is not significant and they are quite competitive. • MMC+MMC-RFE(U,O) versus SVM+SVM-RFE(H,S) – The performance of MMC-RFE(U,O) is improved for Prostate cancer. For the other datasets, the trend is similar to the case of using NMC. • S2N versus MMC-RFE(U,O), SVM-RFE(H,S) – Both MMC-RFE(U,O) and SVM-RFE(H,S) improve the performance of NMC over S2N for Prostate cancer, Leukemia and Medulloblastoma. Wessels et al. [27] have reported that NMC with S2N performs the best among various combinations of gene selection methods and classifiers for Colon cancer and Breast cancer. Consistently with their results, S2N performs better than SVM-RFE(H) for these datasets. However, a significant improvement is achieved for SVM-RFE(S) by setting the C parameter to a small value, e.g. 0.001. Huang and Kecman [28] also reported that the finer tuning of the C parameter can significantly improve the performance of SVM-RFE. Guyon et al. [6] have drawn a conclusion from their result on the Colon cancer dataset that SVM-RFE performs better than both S2N and LDA-RFE. In their experiment, the C parameter was set to 100. However, SVM-RFE(S) with C = 100 gives almost the same error rate as SVM-RFE(H) for all the binary-class datasets in our study, and its performance is poorer than that of S2N for Colon cancer, as mentioned previously. There are some reasons that account for this contradiction. First, although Guyon et al. [6] used SVM and weighted voting [3] for classification, we have found that for the Colon cancer dataset, SVM with C = 100 performs significantly worse than NMC when combined with S2N. As can be seen from Table 1, NMC+SVM-RFE(H) performs even favorably against SVM+SVM-RFE(H). Second, this can be attributed to the selection bias caused by their improper use of cross-validation [12]; they failed to include the gene selection process in the cross-validation. Finally, the performance difference between LDA-RFE and SVM-RFE may be due to the difference in the number of genes eliminated at a time. Guyon et al. [6] also compared the performance between the mean squared error-based RFE (MSE-RFE) and SVMRFE, and claimed the superiority of SVM-RFE. However, our results suggest that MSE-RFE might also show better performance in some cases. Indeed, this has been implied by the work of Li and Yang [10], which showed that ridge regression-based RFE performed better than SVM-RFE. It should be noted that MSE is closely related to classical LDA and ridge regression [29,30]. MMC-RFE is still advantageous over LDA-RFE and MSE-RFE, because MMC-RFE does not need to compute the inverse of a matrix, which makes MMC-RFE a computationally efficient and stable method. As our results indicate, the prediction of clinical outcome is generally more difficult than that of tissue or disease types. The error rates of NMC with S2N for the clinical outcome datasets (Medulloblastoma and Breast Cancer) almost coincide with the results presented in [31], which performed a comparative study on outcome prediction using the same validation strategy as our study. The result for Medulloblastoma shows that the prediction performance can be improved by multivariate gene selection methods such as MMC-RFE and SVM-RFE. However, it is Page 4 of 18 (page number not for citation purposes) BMC Bioinformatics 2006, 7:543 http://www.biomedcentral.com/1471-2105/7/543 Page 5 of 18 (page number not for citation purposes) Table 1: Performance comparison for binary-class datasets. Classifier+Selection criterion Number of genes

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Operations Research Proceedings 2002, Selected Papers of the International Conference on Operations Research (SOR 2002), Klagenfurt, September 2-5, 2002

Preparing the books to read every day is enjoyable for many people. However, there are still many people who also don't like reading. This is a problem. But, when you can support others to start reading, it will be better. One of the books that can be recommended for new readers is operations research proceedings 2002 selected papers of the international conference on operations research sor 20...

متن کامل

12th International Conference on Parallel Architectures and Compilation Techniques (PACT 2003), 27 September - 1 October 2003, New Orleans, LA, USA

Inevitably, reading is one of the requirements to be undergone. To improve the performance and quality, someone needs to have something new every day. It will suggest you to have more inspirations, then. However, the needs of inspirations will make you searching for some sources. Even from the other people experience, internet, and many books. Books and internet are the recommended media to hel...

متن کامل

Reviewing our book reviews: fifty years and counting.

The editorial team of The Gerontologist has been committed to providing its readers with up-to-date information about recently published books focused on gerontological science and practice since the inception of the journal in 1961. The initial issue of the journal included a list of recently published books. The first formal book review (of You Can't Count on Dying, by Natalie Harris Cabot) w...

متن کامل

These three books offer a unique look into the state of art in networks

This is a review of three authoritative books that provide us with an understanding of the latest developments in social network theory. Researchers in marketing have long been interested in networks of peoplee.g. the Word-of-Mouth and the new product diffusion literatures are predicated on a network view of the market. Therefore, the purpose of this review is to make this work accessible to re...

متن کامل

New Biological Books

The aim of this section is to give brief indications of the character, content and cost of new books in the various fields of biology. More books are received by The Quarterly than can be reviewed critically. All submitted books, however, are carefully considered for originality, timeliness, and reader interest, and we make every effort to find a competent and conscientious reviewer for each bo...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Environmental Health Perspectives

دوره 111  شماره 

صفحات  -

تاریخ انتشار 2003